Relation between College Costs and Expected Earnings

Edward Harvey, Yifan Li, Sandani Kumanayake

2023-03-27

Introduction

Question

Data

Accessed via Kaggle.com. Four files from two sources:

Variables

Weaknesses

Cleaning

  1. Starting with diversity data
    • Selected university name, state, and total enrollment variables
    • Pivoted wider by name and state to get counts for different student demographics
  2. Joined salary potential data with newly pivoted diversity data based on university name and state
  3. Jointed tuition cost data to new data frame
  4. Cleaned second tuition cost data and joined to new data frame
    • Pivoted wider to get cost breakdown by income level
    • Joined with new data frame by name and state
    • Ultimately did not use this data in the analysis
  5. Removed NAs as appropriate for individual plots
  6. Calculated percentages, geographic regions, and other minor variable changes as necessary for individual plots
setwd("/Users/sandanikumanayake/Desktop/STAT 697/HW 05/Final/")

#read in diversity dataset
diversity <- read.csv2("diversity_school.csv", sep=",")

#separate total enrollment before pivot, take distinct entries by name and state
total_enrollment <- diversity %>% select(name, state, total_enrollment) %>% 
  distinct()

#drop total enrollment from diversity
diversity <- diversity %>% select(-total_enrollment)

#convert category and state to factor (not sure if this makes a difference?)
diversity$category <- as.factor(diversity$category)
diversity$state <- as.factor(diversity$state)


#pivot demographic data (saved as new df)
diversity_pivot <- diversity %>% 
  filter(!is.na(name)) %>% 
  pivot_wider(id_cols=c(name, state), names_from=category, values_from = enrollment)

#here I rejoin the total enrollment and diversity_pivot is complete

diversity_pivot <- left_join(diversity_pivot, total_enrollment, by=c("name", "state"))


#load in the salary potential data
salary_potential <- read.csv2("salary_potential.csv", sep=",")
salary_potential <- salary_potential %>% rename("state" = "state_name")

#join diversity and salary potential, save as new df
df <- left_join(diversity_pivot, salary_potential, by=c("name", "state"))

#join df with tuition_cost
tuition_cost <- read.csv2("tuition_cost.csv", sep=",")
df <- left_join(df, tuition_cost, by=c("name", "state"))

#drop state code (redundant) and rank(?)
df <- df %>% select(-c("state_code","rank"))


#load tuition income dataset
tuition_income <- read.csv2("tuition_income.csv", sep=",")

#select 2018 only, the most recent data (other datasets do not have multiple years)
tuition_income <- tuition_income %>% 
  filter(year==2018) %>% 
  select(name, state, total_price, campus) %>% 
  distinct()

#convert to factor (is this necessary?)
tuition_income$campus[which(tuition_income$name=="Laurel Technical Institute" & tuition_income$total_price==16168)] <- 'On Campus'

tuition_income$campus <- as.factor(tuition_income$campus)
  
#pivot wider
tuition_income_pivot <- tuition_income %>% pivot_wider(names_from = campus, values_from = total_price)

#change null values to NA for On Campus
tuition_income_pivot$`On Campus`[sapply(tuition_income_pivot$`On Campus`, is.null)] <- NA

#convert state codes to state
tuition_income_pivot$state <- abbr2state(tuition_income_pivot$state)


#join 
df <- left_join(df, tuition_income_pivot, by=c("name", "state"))


colnames(tuition_income_pivot[,3:4]) <- c("on_campus","off_campus")

#save as RDS file
write.csv2(df, file="df.csv")

saveRDS(df, file="df.rds")

Plots

Costs

Expected Earnings

Scatter Plots

Shiny applications not supported in static R Markdown documents

Conclusion